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Abstract —We design an algorithmic framework using matrix 
exponentials for time-domain simulation of power delivery net¬ 
work (PDN). Our framework can reuse factorized matrices to 
simulate the large-scale linear PDN system with variable step- 
sizes. In contrast, current conventional PDN simulation solvers 
have to use fixed step-size approach in order to reuse factorized 
matrices generated by the expensive matrix decomposition. Based 
on the proposed exponential integration framework, we design a 
PDN solver R-MATEX with the fiexible time-stepping capability. 
The key operation of matrix exponential and vector product 
(MEVP) is computed by the rational Krylov subspace method. 

To further improve the runtime, we also propose a distributed 
computing framework DR-MATEX, DR-MATEX reduces Krylov 
subspace generations caused by frequent breakpoints from a 
large number of current sources during simulation. By virtue 
of the superposition property of linear system and scaling 
invariance property of Krylov subspace, DR-MATEX can divide 
the whole simulation task into subtasks based on the alignments 
of breakpoints among those sources. The subtasks are processed 
in parallel at different computing nodes without any commu¬ 
nication during the computation of transient simulation. The 
final result is obtained by summing up the partial results among 
all the computing nodes after they finish the assigned subtasks. 
Therefore, our computation model belongs to the category known 
as Embarrassingly Parallel model. 

Experimental results show R-MATEX and DR-MATEX can 
achieve up to around 14.4x and 98.0x runtime speedups over 
traditional trapezoidal integration based solver with fixed time- 
step approach. 
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I. Introduction 

M odern VLSI design verification relies heavily on 
the analysis of power delivery network (PDN) to 
estimate power supply noises Q-IHl- The performance of 
power delivery network highly impacts on the quality of 
global, detailed and mixed-size placement na-CD, clock tree 
synthesis ca, global and detailed routing ca, as well as tim¬ 
ing ifTHl and power optimization. Lowering supply voltages, 
increasing current densities as well as tight design margins 
demand more accurate large-scale PDN simulation. Advanced 
technologies ifTSll . O, three dimensional (3D) IC structures 
uni-iini, and increasing complexities of system designs all 
make VLSI PDNs extremely huge and the simulation tasks 
time-consuming and computationally challenging. Due to the 
enormous size of modern designs and long simulation runtime 
of many cycles, instead of general nonlinear circuit simulation 
1^ . 1^ . PDN is often modeled as a large-scale linear circuit 
with voltage supplies and time-varying current sources ll22ll - 
1^ . Those linear matrices are obtained by parasitic extraction 
process H, After those processes, we need time- 

domain large-scale linear circuit simulation to obtain the 
transient behavior of PDN with above inputs. 

Traditional methods in linear circuit simulation solve differ¬ 
ential algebra equations (DAE) numerically in explicit ways, 
e.g., forward Euler (EE), or implicit ways, e.g., backward 
Euler (BE) and trapezoidal (TR), which are all based on low 
order polynomial approximations for DAEs ll29ll . Due to the 
stiffness of systems, which comes from a wide range of time 
constants of a circuit, the explicit methods require extremely 
small time step sizes to ensure the stability. In contrast, implicit 
methods can handle this problem with relatively large time 
steps because of their larger stability regions. However, at each 
time step, these methods have to solve a linear system, which 
is sparse and often ill-conditioned. Due to the requirement of 
a robust solution, compared to iterative methods EqI, direct 
methods ED are often favored for VLSI circuit simulation, 
and thus adopted by state-of-the-art power grid (PG) solvers 
in TAU PG simulation contest |[3^ - |[34ll . Those solvers only 
require one matrix factorization (LU or Cholesky factorization) 
at the beginning of the transient simulation. Then, at each fixed 
time step, the following transient computation requires only 
pairs of forward and backward substitutions, which achieves 
better efficiency over adaptive stepping methods by reusing the 


factorization matrix ||24l, ||32l, ll^ in their implicit numerical 
integration framework. However, the maximum of step size 
choice is limited by the smallest distance hupper among the 
breakpoints (351 . Some engineering efforts are spent to break 
this limitation by sacrificing the accuracy. In our work, we 
always obey the upper limit hupper of time step to maintain 
the fidelity of model, which means the fixed time step h cannot 
go beyond hupper in case of missing breakpoints. 

Beyond traditional methods, a class of methods called 
matrix exponential time integration has been embraced by 
MEXP (361 . The major complexity is caused by matrix 
exponential computations. MEXP utilizes standard Krylov 
subspace method (TTl to approximate matrix exponential and 
vector product. MEXP can solve the DAEs with much higher 
order polynomial approximations than traditional ones (36l . 
(33. Nevertheless, when simulating stiff circuits with standard 
Krylov subspace method, it requires the large dimension of 
subspace in order to preserve the accuracy of MEXP approxi¬ 
mation and poses memory bottleneck and degrade the adaptive 
stepping performance of MEXP. 

Nowadays, the emerging multi-core and many-core plat¬ 
forms bring powerful computing resources and opportunities 
for parallel computing. Even more, cloud computing tech¬ 
niques (^ drive distributed systems scaling to thousands 
of computing nodes (^ - (4T1 . etc. Distributed computing 
systems have been incorporated into products of many leading 
EDA companies and in-house simulators (43 - (46l . How¬ 
ever, building scalable and efficient distributed algorithmic 
framework for transient linear circuit simulation is still a 
challenge to leverage these powerful computing tools. The 
papers EH, EH show great potentials by parallelizing matrix 
exponential based method to achieve the runtime performance 
improvement and maintain high accuracy. 

In this work, we develop a transient simulation framework 
using matrix exponential integration scheme, MATEX, for 
PDN simulation. Eollowing are the challenges we need to 
address. Eirst, when the circuit is stiff, the standard Krylov 
subspace has convergence problem and slows down the com¬ 
putation of MEVP. Second, the frequent time breakpoints due 
to the transitions of PDN current sources modeling triggers 
the generations of Krylov subspace. Therefore, we might gain 
performance where we leverage the large time stepping, but 
we also lose runtime for the small step size. Our contributions 
are listed as below: 

• MEVP in MATEX is efficiently computed by rational or 
invert Krylov subspace method. Compared to the com¬ 
monly adopted framework using TR with fixed time step 
(TR-ETS), the proposed MATEX can reuse factorized 
matrix at the beginning of transient simulation to perform 
flexible adaptive time stepping. 

• Among different Krylov subspace methods, we find ra¬ 
tional Krylov subspace is the best strategy for MEVP in 
PDN simulation. Therefore, we design R-MATEX based 
on that and achieve up to around 15 x runtime speedup 
against the benchmarks over the traditional method TR- 
ETS with good accuracy. 

• Eurthermore, DR-MATEX is designed to improve R- 
MATEX with distributed computing resources. 


- Eirst, PDN’s current sources are partitioned into 
groups based on their alignments. They are assigned 
to different computing nodes. Each node runs its 
corresponding PDN transient simulation task and has 
no communication overhead with other nodes. 

- After all nodes finish the simulation computations, 
the results are summed up based on the linear 
superposition property of the PDN system. 

- Proposed current source partition can reduce the 
chances of generating Krylov subspaces and prolong 
the time periods of reusing computed subspace at 
each node, which brings huge computational advan¬ 
tage and achieves up to 98 x speedup over traditional 
method TR-ETS. 

The rest of this paper is organized as follows. Sec. |II| 
introduces the background of linear circuit simulation and 
matrix exponential formulations. Sec. Hill illustrates the Krylov 
techniques to accelerate matrix exponential and vector product 
computation. Sec. hy] presents MATEX circuit solver and 
the parallel framework DR-MATEX. Sec. [V| shows numerical 
results and Sec. [Vl| concludes this paper. 

H. Background 

A. Transient Simulation of Linear Circuit 

Transient simulation of linear circuit is the foundation 
of modern PDN simulation. It is formulated as DAEs via 
modified nodal analysis (MNA), 

Cx(t) = —Gx(t) + Bu(t), (1) 

where C is the matrix for capacitive and inductive elements. 
G is the matrix for conductance and resistance, and B is 
the input selector matrix. x(t) is the vector of time-varying 
node voltages and branch currents. u{t) is the vector of supply 
voltage and current sources. In PDN, such current sources are 
often characterized as pulse or piecewise-linear inputs (22ll . 
(^ to represent the activities under the networks. To solve 
Eq. O numerically, the system is discretized with time step h 
and transformed to a linear algebraic system. Given an initial 
condition x(0) from DC analysis or previous time step x(t) 
and a time step h, + h) can be obtained by traditional low 
order approximation methods (29]|. 

B. Traditional Low Order Time Integration Schemes 

1) BE: Backward Euler based time integration scheme 
(Eq.O) is a robust implicit first-order method. 

+ G^ x(t A-h) = + Bu(t + h). (2) 

2) TR: Trapezoidal based time integration scheme (Eq.©) 
is a popular implicit second-order method. 

u{t) + u{t + h) 

2 

It is probably the most commonly used strategy for large-scale 
circuit simulation, which has higher accuracy than BE. 
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3) BE-FTS and TR-FTS: Methods BE and TR with fixed 
time step (FTS) h are efficient approaches, which were adopted 
by the top PG solvers in 2012 TAU PG simulation contest ||24l, 
ll^ - |[34ll . If only one h is used for the entire simulation, the 
choice is limited by the minimum breakpoint ll35]| distance 
hupper among all the input sources. Fig. [T] (a) has lOps as 
the upper limit for h in BE-FTS and TR-FTS. When the 
alignments of inputs change (as shown in Fig. [T] (b)) shift 
by bps, the resulting upper limit for h becomes bps for the 
approaches with fixed step size. If h is larger than the limit, 
it is impossible to guarantee the accuracy since we may skip 
pivot points of the inputs. 


ically, turning the solution with matrix exponential operator: 


x(f -1- /i) = — I A ^b(t + /i) + A 


_2 b(t 3-h) — b(t) 


hA 


x(t) + A u{t) + A' 


h 

_2 b(t h) — h{t) 
h 


. ( 6 ) 


For the time step choice, breakpoints (also known as input 
transition spots (TS) lEl) are the time points where slopes of 
input vector change. Therefore, for Eq. (|6]), the maximum time 
step starting from t is {tg — t), where tg is the smallest one 
in T5' larger than t. In matrix exponential based framework, 
the limitation of time step size is not the local truncation error 
(LTE), but the activities among all input sources. 




Fig. 1. Example: Interleave two input sources to create smaller transition time. 
( a) Before interleaving, the smallest transition time of the input sources is 
hupper = lOps; (b) After interleaving, the smallest transition time of the 
input sources is hupper = 5ps. 


III. Krylov Subspace Methods for Matrix 
Exponential and Vector Product (MEVP) 

In PDN simulation, A is usually above millions and makes 
the direct computation of matrix exponential infeasible. 
The alternative way to compute the product is through Krylov 
subspace method ij/i . In this section, we first introduce the 
background of standard Krylov subspace for MEVP. Then, 
we discuss invert (I-MATEX) and rational Krylov subspace 
(R-MATEX) methods, which highly improve the runtime 
performance for MEVP. 

A. MEXP: MEVP Computation via Standard Krylov Subspace 
Method 

The complexity of e^v can be reduced using Krylov 
subspace method and still maintained in a high order poly¬ 
nomial approximation (371. MEXP (361 uses standard Krylov 
subspace, which uses A directly to generate subspace basis 
through Arnoldi process (Algorithm [T]). First, we reformulate 
Eq. m into 


C. Matrix Exponential Time Integration Scheme 

The solution of Eq. o can be obtained analytically (29l. 
For a simple illustration, we convert Eq. (ID into 


x(t -t- /i) = e^^(x(t) + F(t, h)) — P(t, h), 


where 


= A-ib(t) + A-2 


h{t + h)- h{t) 
h 


x(f) = Ax(t) + b(t), (4) 

when C is not singulaiQ, 

A = -C-^G , and h{t) = C-^Bu(t). 

Given a solution at time t and a time step h, the solution at 
t^his 


rh 

x(t + h) = e^^x(f) -h / -b r)dr. 

Jo 


(5) 


Assuming that the input u(t) is a piecewise linear (PWL) 
function of f, we can integrate the last term of Eq. ([5]) analyt- 


^Th e assumption is to simplify the explanation in this section. After Sec. 
IIII-BI we use I-MATEX, R-MATEX and DR-MATEX to compute the solution 
of DAE without inversion of C. Therefore, the methods are suitable for 
general DAE system, i.e., Eq. {TJ without the assumption here. 


and 


P(f,/i)=A ^h{th)A 
The standard Krylov subspace 


_2 h{t Ah) — h{t) 
h 


Ki„(A, v) := span{v, Av, ■ ■ ■ , A™ V} 
obtained by Arnoldi process has the relation 
AVjn VjnHjn “t“ 


where is the upper Hessenberg matrix 



( ^1,1 

^1,2 • • • 

1 

hl^rn ^ 


^2,1 

^2,2 

^2,m-l 

h2,m 

H^ = 

0 

^3,2 

^3,m —1 

hs^rn 


V 0 

0 ••• 

hm,m—l 

hm,m / 


(7) 

( 8 ) 

(9) 

( 10 ) 

( 11 ) 


( 12 ) 
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Vm is a n X m matrix by (vi, V 2 , • • • , v^), and is the 
m-th unit vector with dimension n x 1. MEVP is computed 
via 

fa(13) 

The posterior error term is 

rm{h) = ||/3/im+i,mVm+ie’'’e'‘**’"ei||, (14) 

where f3 = ||v||. However, for an autonomous system Cx(t) = 
—Gx(t) in circuit simulation, we consider the residual be¬ 
tween Cx(f) and — Gx(f), which is 

Cx(t) + Gx(f), 

instead of 

x(t) — Ax(f) 

in x(t) = Ax(f). This leads to 

r{m,h) = \\l3hm+i,mCvm+ie^e'^^’^ei\\ (15) 

and helps us mitigate the overestimation of the error bound. 
To generate x(/: + h) by Algorithm [T] we use 

[L, U] = LU_Decompose(Xi), (16) 

where 

Xi = C and X 2 = G 

as inputs for standard Krylov subspace. The error budget e and 
Eq. (O are used to determine the convergence when time step 
is h and Krylov subspace dimension is j (from line [TT] to line 
fTHin Algorithm [T]). 


Algorithm 1: MATEX Arnoldi 


Input: L, U, X 2 , x(t), e, P(t, h), F(f, h) 

Output: x(f -f h),Yrn, H, V 
1 V = x(t) + F(t, h); 

^ = M’ 

3 for j = 1 : m do 

w = U\(L\(X 2 V,)) ; / -k a pair of forward 
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6 

7 

8 
9 

10 
11 
12 

13 

14 

15 end 


and backward substitutions. 

for i = 1 : j do 

hi^j = w^Vj; 


*/ 


w = w - hijWi 

end 

hj+i,j = ||w||; 

'^i+i = 

if r(j, h) < € then 


m = j\ 
break; 


end 


16 x(t + h) = ||v||Vme'‘”ei - F{t, h)\ 


The standard Krylov subspace may not be efficient when 
simulating stiff circuits 13611 . l49l . Eor the accuracy of ap¬ 
proximation of e^v, a large dimension of Krylov subspace 
basis is required, which not only brings the computational 
complexity but also consumes huge amount of memory. The 


reason is that the Hessenberg matrix of standard Krylov 
subspace tends to approximate the large magnitude eigenvalues 
of A ISOll . Due to the exponential decay of higher order 
terms in Taylor’s expansion, such components are not the 
crux of circuit system’s behavior ISOt . ISH . Therefore, to 
simulate stiff circuit, we need to gather more vectors into 
subspace basis and increase the size of to fetch more 
useful components, which results in both memory overhead 
and computational complexity to Krylov subspace generations 
for each time step. In the following subsections, we adopt 
ideas from spectral transformation Eol, ED to effectively 
capture small magnitude eigenvalues in A, leading to a fast 
and accurate MEVP computation. 

B. I-MATEX: MEVP Computation via Invert Krylov Subspace 
Method 

Instead of A, we use A“^ (or G“^C) as our target matrix 
to form 

K„,(A-\v) ;= span{v,A-W,-- - ,A-(™-iV}. (17) 

Intuitively, by inverting A, the small magnitude eigenvalues 
become the large ones of A“^. The resulting is likely to 
capture these eigenvalues first. Based on Arnold! algorithm, 
the invert Krylov subspace has the relation 

A Vm — YjYiELjYi -j- (18) 

The matrix exponential is calculated as 

e^v ^ ei. (19) 

To put this method into Algorithm 1 is just by modifying the 
inputs Xi = G for the LU decomposition in Eq. (O, and 
X 2 = C. In the line fT^ of Algorithm [TJ 


for the invert Krylov version. The posterior error approxima¬ 
tion II 47 II is 

rm{h) = ||/3ft.m+i,TOAv„+ie^H;;^e'‘“ei||, (20) 

which is derived from residual based error approximation 
in ED- However, as mentioned in Sec. IIII-Ai we consider 
the residual of (Cx(t) + Gx(f)), instead of (x(f) — Ax(f)), 
which leads to 

r{m,h) = (21) 

We use Eq. (1211) for the line [U] of Alg. [T] 

C. R-MATEX: MEVP Computation via Rational Krylov Sub¬ 
space Method 

The shift-and-invert Krylov subspace basis tSOll is designed 
to confine the spectrum of A. Then, we generate Krylov 
subspace via 

K^((I-7 A)-\v) := (22) 

span{v, (I - 7 A)-W, ■ ■ ■ , (I - 7 A)-(”*-i)v}, 
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where 7 is a predefined parameter. With this shift, all the 
eigenvalues’ magnitudes are larger than one. Then the inverse 











limits the magnitudes smaller than one. According to EqI, 
ED, the shift-and-invert basis for matrix exponential based 
transient simulation is not very sensitive to 7 , once it is 
set to around the order near time steps used in transient 
simulation. The similar idea has been applied to simple power 
grid simulation with matrix exponential method (521. Here, 
we generalize this technique and integrate into MATEX. The 
Arnoldi process constructs Vm and Hm- We have 

(I ~ m^m H” • (^3) 

We can project the onto the rational Krylov subspace as 
follows. 

(24) 

In the line [16] of Algorithm [T] 

T — 

7 

Following the same procedure ED, ED, the posterior error 
approximation is derived as 

rm{h) = ||. (25) 

7 

Note that in practice, instead of computing (I—directly, 
(C + 7 G)“^C is utilized. The corresponding Arnoldi process 
shares the same skeleton of Algorithm [T] with input matrices 

Xi = (C + 7 G) 

for the LU decomposition Eq. (O, and 
X 2 = C. 

The residual estimation is 

r{m,h) = (26) 

7 

Then, we plug Eq. (1261) into the line [TT] of Algorithm [T] 

D. Regularization-Free MEVP Computation 

When C is a singular matrix, MEXP (361 needs the regular¬ 
ization process (5^ to remove the singularity of DAE in Eq. 
([T]). It is because MEXP needs factorize C directly to form 
the input Xi for Algorithm 1. This brings extra computational 
overhead when the case is large (5^ . It is not necessary if 
we can obtain the generalized eigenvalues and corresponding 
eigenvectors for matrix pencil (—G,C). Based on (54l . we 
derive the following lemma. 

Lemma 1. Considering a homogeneous system 

Cx = —Gx. 

u and X are the eigenvector and eigenvalue of matrix pencil 
(-G,C), then 


Proof. H If A and u are an eigenvalue and eigenvector of a 
generalized eigenvalue problem 

-Gu = ACu. 

Then, x = is the solution of Cx = — Gx. □ 

Because we do not need to compute explicitly during 
Krylov subspace generation, I-MATEX and R-MATEX are 
regularization-free. Instead, we factorize G for invert Krylov 
subspace basis generation (I-MATEX), or (C + 7 G) for 
rational Krylov subspace basis (R-MATEX) [j Besides, their 
Hessenberg matrices Eq. (Ha are invertible, which contain 
corresponding important generalized eigenvalues/eigenvectors 
from matrix pencil (—G, C), and define the behavior of linear 
dynamic system in Eq. (ID of interest. 

E. Comparisons among Different Krylov Subspace Algorithms 
for MEVP Computation 

In order to observe the error distribution versus dimensions 
of standard, invert, and rational Krylov subspace methods for 
MEVP, we construct a RC circuit with stiffness 

= 4 7 X 10® 

Re{\ma.) ’ 

where Xmax = —8.49 x 10^^ and Xmin = —3.98 x 10^^ are 
the maximum and minimum eigenvalues of A = —C“^G. 
Fig. [ 2 ] shows the relative error reductions along the increasing 
Krylov subspace dimension. The error reduction rate of ra¬ 
tional Krylov subspace is the best, while the one of standard 
Krylov subspace requires huge dimension to capture the same 
level of error. For example, it costs almost 10 x of the size 
to achieve around relative error 1% compared to Invert and 
Rational Krylov subspace methods. The relative error is 

I |e^^v — 11 

||e^^v|| 

where h = 0.4ps, 7 = 10“^^. The matrix A is a relatively 
small matrix and computed by MATLAB expm function. The 
result of serves as the baseline for accuracy. The relative 
error is the real relative difference compared to the analytical 
solution of the ODE 


with an initial vector v, which is generated by MATLAB rand 
function. 

The error reduction rate of standard Krylov subspace is the 
worst, while the rational Krylov subspace is the best. It is the 
reason that we prefer rational Krylov subspace (R-MATEX). 
The relative errors of BE, TR and EE are 0.0594, 0.4628, and 
2.0701 X 10^, respectively. The large error of EE is due to 
the instability issue of its low order explicit time integration 
scheme. In Fig. |2l when m = 3, standard, invert and rational 
Krylov subspace methods have 0.8465, 0.0175, and 0.0065, 
respectively. It illustrates the power of matrix exponential 


is a solution of the system. 
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^We repeat the proof from with some modifications for our formula¬ 
tion. 

^It is also applied to the later work of DR-MATEX in Sec. liv^ 









Fig. 2. The relative error vs. dimensional m of different Krylov subspace 


methods. The relative error is 


e^^v-/3Vr, 


phA 




-y = 10 13 Note: The relative error is the difference compared to analytical 
solution e^^'v of the ODE ^ = Ax with an initial vector v, which is 
generated by MATLAB rand function, and its entries are positive numbers 
in (0,1]. 



-^Vrr 



, where h = OAps, 


Fig. 4. The error 

Wf^hA^W 


of MEVP via standard Krylov Subspace: 
vs. time step h and dimension of standard 
Krylov subspace basis (m). The standard Krylov subspace approximates 
the solution well in extremely small h, since it captures the important 
eigenvalues and eigenvectors of A at that region. However, the small h is 
not useful for the circuit simulation. For large h, it costs large m to reduce 
the error. 


Fig. 3. The relative error vs. dimension m of different Krylov subspace 

—^1^, where h = OAps, 


methods. The relative error is ^ 

7 = 10“^^. The rational Krylov subspace has very stable error reduction 
rate. The number in the bracket represents the stiffness value of the system. 


method. Our proposed methods are all stable and can achieve 
improved error numbers. 

In order to observe the different stiffness effects on Krylov 
subspace methods, we change the entries in C and G to make 
the different stiffness value 4.7 x 10^^. Fig. [3] illustrates the 
stable reduction rate of rational method. The stiffness degrades 
the performance of standard Krylov subspace method. Both in¬ 
vert and rational Krylov subspace methods are good candidates 
for stiff circuit system. 

Regarding the relative error distributions vs. time step h and 
dimension m, Fig. SI Fig. 13 and Fig. [ 6 | are computed by stan¬ 
dard, invert, and rational Krylov subspaces (7 = 5 x 10“^^), 
respectively. Fig. [4] shows that the errors generated by standard 
Krylov subspace method has flat region with high error values 
in time-step range of interests. The small (‘unrealistic’) time 
step range has small error values. Compared to Fig. [H invert 
(Fig. [ 5 ]) and rational (Fig. [6]) Krylov subspace methods reduce 
errors quickly for large h. The explanation is that a relatively 



Fig. 5. 


The error of MEVP via invert Krylov Subspace: 


^^ VS- step h and dimension of invert Krylov 
subspace basis (m). Compared to Fig. |4] invert Krylov subspace method 
reduces the errors for large h. 


small portion of the eigenvalues and corresponding invariant 
subspaces determines the flnal result (vector) when time step 
h is larger ll50l . which are efficiently captured by invert and 
rational Krylov subspace methods. 

The error of rational Krylov subspace is relatively insensi¬ 
tive to 7 when it is selected between the time-step range of 
interests (Fig. [7]). Above all, rational Krylov (R-MATEX) and 
invert Krylov (I-MATEX) subspace methods have much better 
performance than standard version. When we deal with stiff 
cases, standard Krylov subspace is not a feasible choice due 
to the large dimension m of Krylov subspace, which causes 
huge memory consumption and poor runtime performance. 
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Algorithm 2: MATEX Circuit Solver 
Input: C, G, B, u, e, and time span T. 

Output: The set of x from [0,T]. 

1 Set Xi, X 2 ; 

2 f = 0; 

3 x(t) =DC_analysis; 

4 [L, U] = LU_Decompose(Xi); 

5 while t < T do 

6 Compute maximum allowed step size h; 

7 Update P(t,/i), F(t,/i); 

8 Obtain x(t + /i) by Algorithm [T] with inputs 
[L, U, X2, /i, t, x{t), e, P(t, h), F(t, h)]; 

9 t = t h", 

10 end 


Fig. 6 . The error of MEVP via rational Krylov Subspace: 

II ^ where 7 = 5 x 10“^^, vs. time step h and 

||eh,Av|| ’I 

dimension of rational Krylov subspace basis (m). Compared to Fig.|3 rational 
Krylov subspace method reduces the errors for large h as Fig. |5] 



Fig. 7. The error of MEVP via rational Krylov Subspace 

^ where h = 4ps. The flat region shows 
the error is actually relatively insensitive to 7 , when 7 is in the range of step 
size h of interests. 


IV. MATEX Framework 
A. MATEX Circuit Solver 

We incorporate matrix exponential based integration scheme 
with Krylov subspace method into our MATEX framework, 
which is summarized in Algorithm [2l We set Xi and X 2 in 
Line [T] based on the choice of Krylov subspace method as 
follows, 

. I-MATEX: Xi = G, X2 = G 

. R-MATEX: Xi = G + 7G, X2 = G 

For linear system of PDN, the matrix factorization in line 
[ 4 ] is only performed once, and the matrices L and U are 
reused in the while loop from line [5] to line [TOl Line [8] uses 
Arnoldi process with corresponding inputs to construct Krylov 
subspace as shown in Algorithm [T] 


B. DR-MATEX (Distributed R-MATEX Eramework) by De¬ 
composition of Input Sources, Linear Superposition, and Par¬ 
allel Computation Model 

1) Motivation: 

There are usually many input sources in PDNs as well as 
their transition activities, which might narrow the regions for 
the stepping of matrix exponential based method due to the 
unaligned breakpoints. In other words, the region before the 
next transition is may be shortened when there are a lot of 
activities from the input sources. It leads to more chances of 
generating new Krylov subspace bases. We want to reduce 
the number of subspace generations and improve the runtime 
performances 

2 ) Treatment and Methodology: 

In matrix exponential based integration framework, we can 
choose any time spot t ^ h G with computed Krylov 

subspace basis. The solution of x(fH-/i) is computed by scaling 
the existing Hessenberg matrix H with the time step h as 
below 

x{t + h) = \\v\\'Vme'^^ei-P{t,h). (27) 

This is an important feature for computing the solutions 
at intermediate time points without generating the Krylov 
subspace basis, when there is no current transition. Besides, 
since the PDN is linear dynamical system, we can utilize 
the well-known superposition property of linear system and 
distributed computing model to tackle this challenge. 

To illustrate our distributed version of MATEX framework, 
we first define three terms to categorize the breakpoints of 
input sources: 

• Local Transition Spot (LTS): the set of T5' at an input 
source to the PDN. 

• Global Transition Spot (GTS): the union of LTS among 
all the input sources to the PDN. 

• Snapshot: a set GTS \ LTS at one input source. 

If we simulate the PDN with respect to all the input sources, 
the points in the set of GTS are the places where generations 

^The breakpoints also put the same constraint on TR-FTS and BE-FTS. 
However, their time steps are flxed aheady, which refrains them from reaching 
this problem in the first place. 
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Fig. 8. Part of a PDN model with input sources from Fig. [9| 

of Krylov subspace cannot be avoided. For example, there are 
three input sources in a PDN (Fig. [8]). The input waveforms are 
shown in Fig.[9j The first line is GTS, which is contributed by 
the union of LTS in input sources #1, #2 and #3. However, 
we can partition the task into subtasks by simulating each 
input source individually. Then, each subtask generates Krylov 
subspaces based on its own LTS and keeps track of Snapshot 
for the later usage of summation via linear superposition. 
Between two LTS points t and t T h, the Snapshot points 

t hi < t + /i2 t hi G (t^t h] 

can reuse the Krylov subspace generated at t. For each node, 
the chances of generation of Krylov subspaces are reduced. 
The time periods of reusing latest Krylov subspaces are 
enlarged locally and bring the runtime improvement. Besides, 
when subtasks are assigned, there is no communication among 
the computing nodes, which leads to so-called Embarrassingly 
Parallel computation model. 



Fig. 9. Illustration of input transitions. GTS: Global Transition Spots; LTS: 
Local Transition Spots; Snapshots: the crossing positions by dash lines and 
LTS #k without solid points. 

3) More Aggressive Tasks Decomposition: We divide the 
simulation task based on the alignments of input sources. 
More aggressively, we can decompose the task according to 
the “bump” shapes of the input sources 0 We group the input 
sources, which have the same 


{tdelay-) trise^ tfalh l^width) 

Tbm power grid benchmarks provide the pulse input model in SPICE 
format. 


Global Transition Spots (GTS) 



Local Transition Spots 
(LTS) at #1.1 in Group 1 


Local Transition Spots 
(LTS) at #1.2 in Group 4 


Local Transition Spots 
(LTS) at #2.1 in Group 2 


Local Transition Spots 
(LTS) at #2.2 in Group 3 


Local Transition Spots 
(LTS) at #3 in Group 4 


Fig. 10. Grouping of “Bump” shape transitions for sub-task simulation. 
The matrix exponential based method can utilize adaptive stepping in each 
LTS and reuse the Krylov subspace basis generated at the latest point in 
LTS. However, traditional methods (TR, BE, etc.) still need to do time 
marching, either by pairs of forward and backward substitutions and proceed 
with hxed time step, or by re-factorizing matrix and solving linear system for 
adaptive stepping. (Pulse input information: delay time; frzse- rise 

time; t^idth- pulse width; tfaih fall time; and period). 


into one set. For example, the input source #1 of Fig. [9] is 
divided to #1.1 and #1.2 in Fig. [TOl The input source #2 
in Fig. [9] is divided to #2.1 and #2.2 in Fig. [TOl Therefore, 
there are four groups in Fig.[T0l Group 1 contains LTSij^l.l. 
Group 2 contains I/TS'#2.1. Group 3 contains I/TS'#2.2. 
Group 4 contains LT5'#1.2 and #3. Our proposed framework 
MATEX is shown in Fig. [TT] After pre-computing GTS and 
decomposing LTS based on “burm” shape (Fig.fTOb. we group 
them and form LTS #1 ^ #^|j 

4) MATEX Scheduler in DR-MATEX: 

In DR-MATEX, the role of MATEX scheduler is just to 
send out GTS and LTS to different MATEX slave nodes 
and collect final results after all the subtasks of transient 
simulation are finished. The node number is based on the 
total number of subtasks, which is the group number after 
PDN source decomposition. Then the simulation computations 
are performed in parallel. Each node has its own inputs. Eor 
example, Node#A; has GTS, LTS#k, Pk and F/^, which 
contain the corresponding b for node k. Scheduler does not 
need to do anything during the transient simulation, since there 
are no communications among nodes before the stage of “write 
back” (in Eig. Ell, by when all nodes complete their transient 
simulations. 

Within each slave node, the circuit solver (Algorithm O 
computes transient response with varied time steps. Solutions 
are obtained without re-factorizing matrix during the computa¬ 
tion of transient simulation. The computing nodes write back 
the results and inform the MATEX scheduler after finishing 
their own transient simulation. 

^There are alternative decomposition strategies. It is also easy to extend 
the work to deal with different input waveforms. We try to keep this part as 
simple as possible to emphasize our framework. 
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MATEX Slave Node #1: 
Circuit (Local Copy) 


Global Transitions Spots 
(Local Copy of GTS) 


Local Transition Spots 
(LTS) #1 


Circuit Solver 


Local Results 


wri^ back 


MATEX Slave Node #2: 
Circuit (Local Copy) 


Global Transitions Spots 
(Local Copy of GTSj 


Local Transition Spots 
(LTS) #2 


Circuit Solver 


Local Results 


v^it5 

bac< 


MATEX Scheduler: 
Circuit 


Global Transitions Spots 
(GTS) 


Local Transition Spots 
(LTS) #1 


Local Transition Spots 
(LTS) #2 


Local Transition Spots 
(LTS) #K 


Local Results Collection 


V vv 


Global Results 


Superposition based on 
GTS 


MATEX Slave Node #K: 
Circuit (Local Copy) 


Global Transitions Spots 
(Local Copy of GTS) 


Local Transition Spots 
(LTS) #K 


Circuit Solver 


Local Results 


Fig. 11. DR-MATEX: The distributed MATEX framework using R-MATEX circuit solver. 


Algorithm 3: DR-MATEX: The distributed MATEX 
framework using R-MATEX at Node#/c. 


Input: LTS#k, GTS, P/c, Fk, error tolerance Etou and 
simulation time span T. 

Output: Local solution x along GTS in node 

k ^ [!,••• : *5], where S is the number of nodes 

1 t = 0, Xi = C + 7G, and X 2 = C; 

2 x(t) = Local_Initial_Solution; 

3 [L, U] = LU_Decompose(Xi); 

4 while t < T do 

5 Compute maximum allowed step size h based on 
GTS; 

6 if t G LTS#k then 

/-k Generate Krylov subspace for the 
point at LTS^k and compute x(t +/i) 

k / 

1 [x(t + /i), v] = 

MATEX_Arnoldi(L, U, X 2 , 
h, t, x(t), e, P/c(t, h),Fk{t, h)); 


8 

9 

10 


(^Its — 

end 

else 


/k Obtain x(t + /i) at Snapshot with 


computed Krylov subspace */ 


11 

12 

13 

14 

15 end 


ha — t E h dits 9 
+ h) = ||v||Vme 

end 

t = t E h; 


hair 77 


P/c (^5 9 


C. Runtime Analysis of MATEX PDN Solver 

Suppose we have the dimension of Krylov subspace basis 
m on average for each time step and one pair of forward and 
backward substitutions consumes runtime The total time 
of serial parts is TgeriaU which includes matrix factorizations, 
result collection, etc. Eor x(t + h), the evaluation of matrix 
exponential with is T/y, which is in proportion to the 

time complexity Besides, we need extra T^ to form 

x(t + h), which is proportional to 0{nnn?) by 

Given K points of GTS, without decomposition of input 
sources, the runtime is 


KmTjjs + K{Th + Tq) + Tserial- (28) 

After dividing the input transitions and sending to enough 
computing nodes, we have k points of LTS for each node 
based on feature extraction and grouping (e.g., /c = 4 for one 
“bump” shape feature). The total computation runtime is 


kmTjjs + K{Th + Te) + TgeriaU (29) 


where K{Th + Tg) contains the portion of computing 
Snapshot in DR-MATEX mode. The speedup of DR-MATEX 
over single MATEX is 


Speedup 


KmTjjs + K{Th + Te) -h Tgerial 
kmTi)s + K(Th + Tg) + Tgerial 


(30) 


Eor R-MATEX, we have small m. Besides, is relatively 
larger than {Th + Tg) in our targeted problem. Therefore, the 
most dominating part is the KmTijs in Eq. (1^ . We can always 
decompose input source transitions, and make k smaller than 

K. 

In contrast, suppose the traditional method with fixed step 
size has N steps for the entire simulation, the runtime is 


NTjjs “h Tgerial • 
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TABLE I 

Specifications of IBM power grid benchmarks. 


Design 

#R 

#C 

#L 

#I 

#V 

#Nodes 

ibmpglt 

41K 

IIK 

277 

IIK 

14K 

54K 

ibmpgZt 

245K 

37K 

330 

37K 

330 

165K 

ibmpgSt 

1.6M 

201K 

955 

201K 

955 

l.OM 

ibmpg4t 

1.8M 

266K 

962 

266K 

962 

L2M 

ibmpgSt 

1.6M 

473K 

277 

473K 

539K 

2.1M 

ibmpgbt 

2.4M 

761K 

381 

761K 

836K 

3.2M 


Then, the speedup of distributed DR-MATEX over the tradi¬ 
tional method is 

Speedup' = , ^ rr. -• (31) 

kmTi)s + K(Th + Tq) -h Tserial 

Note that, when the minimum distance among input source 
breakpoints decreases, large time span or many cycles is 
required to simulate PDNs, the schemes with such uniform 
step size would degrade runtime performance furthermore due 
to the increase of N. In contrast, in MATEX PDN solver, 
K is not so sensitive to such constraints. Besides, k can be 
maintained in a small number based on the decomposition 
strategy. Therefore, the speedups of our proposed methods 
tend to be larger when the simulation requirements become 
harsher. 

V. Experimental Results 

We implement all the algorithms in MATLAB R2014l£| 
and use UMEPACK package for LU factorization. Eirst, we 
compare I-MATEX, R-MATEX and TR in order to show our 
runtime improvements in single machine framework in Table 
mi Second, we show our distributed framework DR-MATEX 
achieves large speedups in Table |III1 The experiments are 
conducted on the server with Intel(R) Xeon (R) E5-2640 v3 
2.60GHz processor and 125GB memory. 

A. Performance of I-MATEX and R-MATEX in Sec. \IV-A\ 

We compare our proposed I-MATEX and R-MATEX against 
the popular TR-ETS on the IBM power grid benchmarks ll22l . 
Among the current sources, the smallest interval between two 
breakpoints is hupper = lOps, which puts the upper limit 
of the TR’s step size. All of these cases have very large 
numbers of input current sources. Table [I] shows the details of 
each benchmark circuit of which size ranges from 54K up to 
3.2M. The simulation time is 10ns. Erom ibmpglt to ibmpgbt, 
TR uses fixed step size in lOps. We also change the IBM 
power grid benchmark to make the smallest distance among 
breakpoints Ips by interleaving input sources’ breakpoints 
(similar as Eig. [T]). Therefore, the fixed step size method 
can only use at most Ips. The names of those benchmarks 
are ibmpglt_new, ibmpg2t_new, ibmpg3t_new, ibmpg4t_new, 
ibmpg5t_new and ibmpg6t_new. 

After DC analysis in TR-ETS, we LU factorize matrix 
once for the later transient simulation, which only contains 

^Measurements reported are on MATLAB implementations. They are 
subject to limitations and are not directly comparable to C++ implementations 
reported in literature such as (44). 


time stepping. Actually, multiple factorized matrices can be 
deployed EH, lES). We can choose one of them during the 
stepping. The problem is the memory and runtime overhead 
for the multiple matrix factorizations. Another point is if large 
time step h' is chosen, the standard low order scheme cannot 
maintain the accuracy. 

Experiment is conducted on a single computing node. In 
Table im we record the total simulation runtime Total(s), 
which includes the processes of DC and transient simulation, 
but excludes the non-numerical computation before DC, e.g., 
netlist parsing and matrix stamping. We also record the part 
of transient simulation Tran(s), excluding DC analysis and 
LU decompositions. The speedup of I-MATEX is not as 
large as R-MATEX, because I-MATEX with a large spec¬ 
trum of A generates large dimension m of Krylov subspace. 
Meanwhile, the step size is not large enough to let it fully 
harvest the gain from time marching with stepping. In contrast, 
R-MATEX needs small dimension numbers m of rational 
Krylov subspace, which ranges from 2 to 8 in those cases. 
Therefore, they can benefit from large time stepping, shown as 
SPDPJ’^. Eor ibmpg4t, R-MATEX achieves maximum speedup 
resulted from the relatively small number of breakpoints in that 
benchmark, which is around 44 points, while the majority of 
others have over 140 points. 

In Table im our single mode R-MATEX achieves the average 
speedup 5x over TR-ETS. Note the average speedup number 
of single mode R-MATEX over TR-ETS for the original IBM 
benchmark (ibmpglt^ibmpgbt) is less than the speedup of 
the new test cases (ibmpglt_new^ibmpg6t_new). As we men¬ 
tioned before, ibmpglt_new^ibmpg6t_new have harsher input 
constraints, making the available step size only Ips. Therefore, 
the adaptive stepping by R-MATEX is more beneficial to 
the runtime performance in ibmpglt_new~ibmpg6t_new than 
ibmpglt ibmpgbt. 

B. Performance of DR-MATEX in Sec. \IV-B\ 

We test our distributed DR-MATEX in the following ex¬ 
periments with the same IBM power grid benchmarks. These 
cases have many input transitions {GTS) that limit step sizes 
of R-MATEX. We divide the region before the computation of 
simulation. We decompose the input sources by the approach 
discussed in Sec. IIV-B3I and obtain much fewer transitions 
of LTS for computing nodes. The original input source 
numbers are over ten thousand in the benchmarks. However, 
based on “bump” feature (as shown in Eig. (TO]), we obtain 
a fairly small numbers for each computing node, which is 
shown as Group # in Table [IIIl (Now, the fact that hundred 
machines to process in parallel is quite normal El, lEl in 
the industry.) We pre-compute GTS and LTS groups and 
assign sub-tasks to corresponding node^. MATEX scheduler 
is only responsible for simple superposition calculation at the 
end of simulation. Since the slave nodes are in charge of all 
the computing procedures (Eig. HB for the computation of 

^ Based on the feature of input sources available, the preprocessing is very 
efficient, which takes linear time complexity to obtain GTS, LTS and separates 
the sources into different groups. 
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TABLE II 

Performance COMPARISONS (single computing node): TR-FTS, I-MATEX, and R-MATEX.DC(s): Runtime of DC analysis (seconds); m/: 
The maximum m of Krylov subspace in I-MATEX. Tran(s): Runtime of transient simulation after DC (seconds), excluding the 

MATRIX FACTORIZATION RUNTIME; Total(s): RUNTIME OF OVERALL TRANSIENT SIMULATION (SECONDS); Df(uV): MAXIMUM AND AVERAGE VOLTAGE 
DIFFERENCES COMPARED TO PROVIDED SOLUTIONS (UV); UIr: THE MAXIMUM m OF KRYLOV SUBSPACE IN R-MATEX : SPEEDUP OF 

R-MATEX OVER TR-FTS WITH RESPECT TO Tran(s); SPDPJ': Speedup of R-MATEX over I-MATEX with respect to Tran(s). 


Design 

DC(s) 

TR-FTS 

I-MATEX 

R-MATEX 

Speedups 

Tran(s) 

Total(s) 

m/ 

Tran(s) 

Total(s) 

Df(uV) 

rriR 

Tran(s) 

Total(s) 

Df(uV) 

SPDPfr 

SPDPJ 

ibmpglt 

0.2 

5.7 

6.00 

30 

28.8 

28.9 

58\9.8 

5 

10.1 

10.3 

45\6.8 

0.6x 

2.9x 

ibmpg2t 

0.8 

40.0 

41.9 

28 

130.0 

130.9 

92\10.5 

5 

35.6 

37.4 

45\6.8 

l.lx 

3.7x 

ibmpgSt 

16.4 

263.2 

295.0 

29 

1102.5 

1115.1 

95\20.4 

5 

275.5 

301.0 

95\18.5 

l.Ox 

4.0X 

ibmpg4t 

13.5 

460.8 

501.9 

29 

433.8 

458.2 

101\39.3 

5 

200.5 

239.1 

99\34.2 

2.3x 

2.2x 

ibmpgSt 

9.0 

476.6 

498.0 

30 

1934.4 

1944.5 

29\5.6 

5 

383.1 

401.9 

29\4.4 

1.2x 

5.0X 

ibmpgbt 

15.3 

716.0 

749.1 

25 

2698.9 

2713.7 

39\8.6 

5 

773.5 

800.5 

33\5.6 

0.9x 

3.5x 

ibmpglt new 

0.2 

51.3 

51.7 

30 

27.2 

27.4 

58\9.8 

5 

11.7 

12.1 

53\6.9 

4.4x 

2.3x 

ibmpg2t new 

0.9 

431.4 

433.5 

28 

114.9 

115.7 

49\10.5 

5 

43.3 

44.9 

33\5.6 

lO.Ox 

2.7x 

ibmpg3t new 

16.3 

3716.5 

3749.0 

29 

1219.3 

1232.6 

95\20.4 

5 

481.7 

508.2 

95\18.9 

7.7x 

2.5x 

ibmpg4t new 

18.3 

5044.6 

5085.3 

29 

753.5 

776.4 

101\39.3 

6 

350.9 

387.2 

99\34.2 

14.4X 

2.lx 

ibmpg5t new 

10.5 

5065.9 

5110.1 

30 

2494.0 

2504.7 

30\5.6 

5 

746.2 

766.4 

30\4.4 

6.8x 

3.3x 

ibmpg6t new 

13.1 

7015.3 

7059.7 

25 

3647.9 

3663.1 

39\8.6 

6 

895.1 

923.1 

33\7.3 

7.8x 

4.1x 

Average 

— 

— 

— 

— 

— 

— 

65\15.7 

— 

— 

— 

57\12.8 

5x 

3x 


their own transient simulation tasks, and have no commu¬ 
nications with others, our framework falls into the category 
of Embarrassingly Parallelism model. We can easily emulate 
the multiple-node environment. We simulate each group using 
the command “matlab -singleCompThread” in our server. We 
record the runtime numbers for each process (slave nodes) and 
report the maximum runtime as the total runtime “Total(s)” 
of DR-MATEX in Table Hill We also record “pure transient 
simulation” as “Tran(s)”, which is the maximum runtime of 
the counterparts among all computing nodes. 

For TR-FTS, we use h = lOps, so there are 1,000 pairs 
of forward and backward substitutions during the process 
of pure transient simulation for ibmpglt^ibmpght; We use 
h = Ips for ibmpglt_new^ibmpg 6 t_new. Therefore, we 
have 10,000 pairs of forward and backward substitutions for 
stepping. In DR-MATEX, the circuit solver uses R-MATEX 
with 7 = 10 “^^, which is set to sit among the order of varied 
time steps during the simulation (since Sec. IIII-El discusses the 
insensitivity of 7 around the step size of interests). TR-FTS is 
not distributed because it has no gain by dividing the current 
source as we do for the DR-MATEX. TR-FTS cannot avoid the 
repeated pairs of forward and backward substitutions. Besides, 
adaptive stepping for TR-FTS only degrades the performance, 
since the process requires extra matrix factorizations. 

In Table Hill our distributed mode gains up to 98 x for 
the pure transient computing. The average peak dimension 
m of rational Krylov subspace is 7. The memory overhead 
ratio for each node (around 1.6 x over TR-FTS on average) is 
slightly larger, which is worthwhile with respect to the large 
runtime improvement. With the huge reduction of runtime for 
Krylov subspace generations, the serial parts, including LU 
and DC, play more dominant roles in DR-MATEX, which 
can be further improved using advance matrix solvers, such 
as ll58]| . 

VI. Conclusions and Future Directions 

In this work, we propose an efficient framework MATEX for 
accurate PDN time-domain simulation based on the exponen¬ 


tial integration scheme. We visualize the error distributions to 
show the advantages of using rational (R-MATEX) and invert 
(I-MATEX) Krylov subspace methods for matrix exponential 
and vector product (ME VP) over standard Krylov sub space 
method (MEXP). For the PDN simulation, our time integration 
scheme can perform adaptive time stepping without repeating 
matrix factorizations, which cannot be achieved by traditional 
methods using implicit numerical integration with fixed time- 
step scheme. Compared to the commonly adopted framework 
TR with fixed time step (TR-FTS), our single mode framework 
(R-MATEX) gains runtime speedup up to around 15 x. We also 
show that the distributed MATEX framework (DR-MATEX) 
leverages the superposition property of linear system and 
decomposes the task based on the feature of input sources, 
so that we reduce chances of Krylov subspace generations 
for each node. We achieve runtime improvement up to 98 x 
speedup. 

We show the exponential integration with Krylov subspace 
methods maintains high order accuracy and fiexible time 
stepping ability. The exponential integration framework was 
actually mentioned by the very early work in circuit simulation 
algorithms 1291 , but it had not attracted too much attention due 
to the high computational complexities of matrix exponential 
during that time. Nowadays, the progress of Krylov subspace 
methods provides efficient way to compute matrix exponential 
and vector product, so that we can utilize certain features of 
exponential integration, which are hardly obtained by tradi¬ 
tional time integration schemes. Exponential integration can 
also serve as stable explicit schemes EH, m for general 
dynamical systems. It is a promising framework for the future 
circuit simulation algorithms and software. The opportunities 
of parallel and distributed computing with the cutting-edge 
multi-core and many-core hardware are also worth exploring 
for the further parallelism and runtime improvement. 
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TABLE III 

The performance of DR-MATEX (Distributed R-MATEX). Group #: Group number of the testcases. This number represents the 

TOTAL NUMBER OF SIMULATION SUB-TASKS FOR THE DESIGN; Tran(s): RUNTIME OF TRANSIENT SIMULATION AFTER DC (SECONDS); Total(s) I 

Runtime of overall transient simulation (seconds); Max. Df.(V) and Avg. Df.(V): maximum and average differences compared to the 
SOLUTIONS OF ALL OUTPUT NODES PROVIDED BY IBM POWER GRID BENCHMARKS. SFDVtr' SPEEDUP OVER TR-FTS ’ S Tran(s) IN TABLEflll SPDPrl 
Speedup over R-MATEX’s Tran(s) in TABLEflll Peak m: the peak dimension used in DR-MATEX for ME VP; Mem. Ratio over TR-FTS: The 
PEAK MEMORY COMPARISON BETWEEN THE MAXIMUM MEMORY CONSUMPTION OF DR-MATEX OVER TR-FTS IN TABLEflll 


Design 

DR-MATEX 

Speedups 

Peak 

m 

Mem. Ratio 
over TR-FTS 

Group # 

Tran(s) 

Total(s) 

Max Df.(V) 

Avg Df.(V) 

SPDPtr 

SPDPr 

ibmpglt 

100 

1.4 

1.9 

5.3e-5 

8.6e-6 

4.0x 

7.1X 

6 

1.9 

ibmpg2t 

100 

8.9 

11.4 

4.6e-5 

8.6e-6 

4.5x 

4.0x 

7 

1.9 

ibmpg3t 

100 

91.7 

129.9 

9.6e-5 

19.7e-6 

2.9x 

4.4x 

6 

1.5 

ibmpg4t 

15 

52.3 

112.2 

9.9e-5 

27.9e-6 

8.8x 

3.8x 

8 

1.4 

ibmpg5t 

100 

148.4 

178.9 

9.0e-5 

l.le-6 

3.2x 

2.6x 

7 

1.5 

ibmpg6t 

100 

189.9 

234.2 

3.4e-5 

7.2e-6 

3.8x 

4.1X 

7 

1.5 

ibmpglt new 

100 

2.4 

2.8 

5.3e-5 

8.6e-6 

21.8x 

5.0X 

6 

1.9 

ibmpg2t new 

100 

5.6 

7.0 

4.6e-5 

8.6e-6 

61.6X 

6.2x 

7 

1.9 

ibmpg3t new 

100 

103.0 

140.9 

9.8e-5 

19.9e-6 

25.6x 

3.3x 

7 

1.5 

ibmpg4t new 

15 

51.5 

108.4 

9.9e-5 

27.6e-6 

98.Ox 

6.8x 

8 

1.4 

ibmpg5t new 

100 

185.6 

227.8 

9.9e-5 

2.2e-6 

27.3X 

4.0x 

7 

1.5 

ibmpg6t new 

100 

274.8 

317.7 

3.4e-5 

7.1e-6 

25.5x 

3.3x 

7 

1.5 

Average 

— 

— 

— 

7.1e-5 

12.3e-6 

26 X 

5x 

6.7 

1.6 
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